Analyzing Data: Conditional Means

The conditional mean will be our first predictive algorithm. Conditional means answer the question: “Given what we know about a certain case, what can expect to see, on average?” The conditional mean is a powerful tool that is typically quite easy to explain to decision-makers.

We’ll go through the following steps:

  1. Computing and plotting unconditional means
  2. Computing and plotting conditional means using a single predictor.
  3. Computing and plotting conditional means using multiple predictors.

Dataset for this week

We will be working with a dataset put together by the Census Bureau that summarizes the characteristics of the 3,088 counties in the United States.

load("pd.Rdata")
pd%>%filter(county=="Los Angeles County, CA")%>%select(percapinc.2012)
## # A tibble: 1 x 1
##   percapinc.2012
##            <dbl>
## 1          44474

The codebook for this dataset is stored as another dataset, labels_explain. The first column in this dataset is variable names, the second column is a full explanation of that variable.

## Full explanation of data in codebook
load("pd_lab_explain.Rdata")

#or use View
View(lab_explain)

Filter and select

pd%>%filter(county=="Los Angeles County, CA")%>%
  select(county,percapinc.2012)
## # A tibble: 1 x 2
##   county                 percapinc.2012
##   <chr>                           <dbl>
## 1 Los Angeles County, CA          44474
pd%>%filter(coll_grad_pc>30)%>%
  select(county,coll_grad_pc,percapinc.2012)%>%
  arrange(-percapinc.2012)
## # A tibble: 339 x 3
##    county                   coll_grad_pc percapinc.2012
##    <chr>                           <dbl>          <dbl>
##  1 New York County, NY              58.1         119347
##  2 Marin County, CA                 54.6          93407
##  3 Teton County, WY                 49            93194
##  4 Nantucket County, MA             43.6          83634
##  5 Arlington County, VA             71.2          83242
##  6 Haines Borough, AK               33            82323
##  7 Fairfield County, CT             44.6          81068
##  8 Alexandria city, VA              60.5          80952
##  9 Pitkin County, CO                55.3          80066
## 10 San Francisco County, CA         52            80014
## # … with 329 more rows

Quick Exercise: What’s per capita income in Los Angeles County, CA in 2012?

Dependent Variable

Our working example will be based on predicting income in a given county. Suppose we want to know what income level we can expect for a geographic area based on observed characteristics, such as the proportion of the population with a bachelor’s degree. How would we predict the income based on what we know about the geographic area?

Let’s begin by plotting the data to see what it looks like. To do this I need to first rank the counties by income. To create a rank variable that will be stored in the pd dataset, I use the mutate command. This creates a variable based on some calculation then stores it in the same dataset. I’m then going to plot incomes for each county in descending rank order. Using the plotly library I can make this interactive so we know which counties we’re talking about.

This code creates a new variable called percapinc_rank, which ranked all counties on the basis of their income.

## Create a rank variable for income 
pd<-pd%>%mutate(percapinc_rank=rank(percapinc.2010))

The next code will create a graphic for us. We will be using `ggplot to make all of our graphics in this class. The first step in ggplot is to create a graphics object. We could name it anything, I’m just going to name it gg. Within this object, I declare the data (pd) the x variable (percapinc_rank) and the y variable (percapinc.2010). I also note that I want to use the county name (county) as text.

## Plot by rank
gg<-ggplot(data=pd, aes(x=percapinc_rank,
                         y=percapinc.2010,
                         text=county))

Now I need to declare the type of graphic, or geometry. By specifiying geom_point I’m saying I want a scatterplot.

## Add Points
gg<-gg+geom_point(alpha=.5,size=.5)

Now I’m going to add labels for the x and y axis.

## Add labels
gg<-gg+xlab("Rank")+ylab("Per Capita Income, 2010")

gg And now we’re ready to call the graphics object, gg

gg

I’m going to store this gg object in another object called gg1. This is so I can come back to it later.

gg1<-gg
# Make Interactive plot
gg_p<-ggplotly(gg)

gg_p

Unconditional Means

If you were asked to predict the income for a given area without any additional information, the likely best guess is the overall average. We’re going to begin with the unconditional mean, or simple average, as our first prediction. We’ll again use the mutate command to plug in a variable that will be the average for every county, and we’ll plot this as a predictor.

Our notation for the unconditional mean as a predictor is:

\[\hat{Y}=\bar{Y} \]

What this says is our predicted income \(\hat{Y}\) is equal to average income \(\bar{Y}\) We’ll always use \(Y\) as the notation for a dependent variable and \(X\) as the notation for an independent variable.

##Unconditional Average
pd%>%summarize(mean_percapinc.2010=mean(percapinc.2010,na.rm=TRUE))
## # A tibble: 1 x 1
##   mean_percapinc.2010
##                 <dbl>
## 1              34003.
##Unconditional Average as a Predictor
pd<-pd%>%mutate(mean_percapinc.2010=mean(percapinc.2010,na.rm=TRUE))
##Plotting
gg<-ggplot(data=pd,aes(y=percapinc.2010,x=percapinc_rank,color="Actual"))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+geom_point(aes(y=mean_percapinc.2010,x=percapinc_rank,
                  color="Predicted: Unconditional Mean"),
                  size=.5)
gg<-gg+xlab("Rank of Per Capita Income")+ylab("Per Capita Income")
gg<-gg+scale_color_manual(name="Type",
                          values=c("Actual"="black",
                          "Predicted: Unconditional Mean"="blue")
                          )
gg<-gg+theme(legend.position="bottom")

gg

##Save for later

gg2<-gg

This is of course a terrible prediction. In the absence of any other information, it’s many times the best we can do, but we really ought to be able to do better.

To understand how far off we are, we need to summarize our errors. We will use different ways of doing this this semester, but let’s start with a very standard one, Root Mean Squared Error, or RMSE. An error term is the vertical distance between each point and its prediction. The RMSE is the square root of the sum of squared errors. (Q:why do we square them?).

\[RMSE(\hat{Y})=\sqrt{ 1/n \sum_{i=1}^n(Y_i-\hat{Y_i})^2} \]

The error term for our prediction using unconditional means will be stored in the variable \(e1\). This variable will be equal to the actual value of per capita income percapinc.2010 minues the mean value of per capita income mean_percapinc.2010.

pd<-pd%>%mutate(e1=percapinc.2010-mean_percapinc.2010)

To calculate the root mean squared error, we use the rmse function from the Metrics library. The code below calculates and displays the rmse

## RMSE

rmse_uncond_mean<-rmse(pd$percapinc.2010,pd$mean_percapinc.2010)

rmse_uncond_mean
## [1] 7817.037

What this means is, on average, we are off by 7817.04, which is a lot!

##Conditional Means With One Predictor Variable

To incorporate additional information into the mean, we need to calculate averages at levels of other predictors. Let’s calculate the average per capita income at different levels of college education. The code below will calculate average income across counties at four different levels of college education– the four quantiles of college education in the dataset.

##Condtional Average across a single variable

## Create a variable for quartiles of college education
pd<-pd%>%mutate(coll_grad_level=ntile(coll_grad_pc,4))

pd%>%select(county,coll_grad_pc,coll_grad_level)%>%View()

table(pd$coll_grad_level)
## 
##   1   2   3   4 
## 772 772 772 772
##pd<-pd%>%mutate(coll_grad_level=ntile(coll_grad_pc,4))

pd<-pd%>%group_by(coll_grad_level)%>% ## Group by predictor
  ##Calculate mean at each level of predictor
  mutate(pred_income_college=mean(percapinc.2010))%>%
  ## Ungroup
  ungroup()%>% 
  #Rank by prediction, with ties sorted randomly
  mutate(pred_income_college_rank=rank(pred_income_college,ties.method="random"))

pd%>%select(county,coll_grad_pc,coll_grad_level,pred_income_college)%>%View()

To visualize this we can use a similar graphic as before:

pd%>%group_by(coll_grad_level)%>% ## Group by predictor
  ##Calculate mean at each level of predictor
  summarise(pred_income_college=mean(percapinc.2010))
## # A tibble: 4 x 2
##   coll_grad_level pred_income_college
##             <int>               <dbl>
## 1               1              28056.
## 2               2              32371.
## 3               3              35292.
## 4               4              40293.
gg<-ggplot(data=pd,aes(x=pred_income_college_rank,y=percapinc.2010,color="Actual"))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+geom_point(aes(x=pred_income_college_rank,y=pred_income_college,color="Predicted:Conditional Mean, 1 var"))
gg<-gg+ scale_color_manual("Type",values=c("Predicted:Conditional Mean, 1 var"="red","Actual"="black"))
gg<-gg+theme(legend.position="bottom")
gg<-gg+xlab("Rank")+ylab("Per Capita Income, 2010")
gg

##Save for later
gg3<-gg

Our notation for this predictor will be:

\[\hat{Y}=(\bar{Y}|X=x) \]

That is, the predicted value of y, \(\bar{Y}\) is equal to the mean value of \(Y\) given our predictor \(X\) (college graduate levels in this case) is equal to a given value of \(X\), denoted by \(x\).

Let’s see what happened to our RMSE when we did a conditional as opposed to an unconditional mean.

Remember

rmse_cond_mean_one<-rmse(pd$percapinc.2010,pd$pred_income_college)
rmse_cond_mean_one
## [1] 6425.743

Quick Exercise: Calculate per capita income as a function of the proportion of the county with a high school education

We can continue “binning” the data to define averages by more and more subgroups. For instance, we might want to calculate the average income by education AND home ownership rate.

New Variable Home Ownership Rate

## Create a variable for quartiles of college education
pd<-pd%>%mutate(homeown_rate_level=ntile(homeown_rate,4))
pd%>%group_by(homeown_rate_level)%>% ## Group by predictor
  ##Calculate mean at each level of predictor
  summarise(pred_income_homeown_rate=mean(percapinc.2010))
## # A tibble: 4 x 2
##   homeown_rate_level pred_income_homeown_rate
##                <int>                    <dbl>
## 1                  1                   34925.
## 2                  2                   33245.
## 3                  3                   33528.
## 4                  4                   34314.
pd<-pd%>%group_by(coll_grad_level,homeown_rate_level)%>% ## Group by predictor
  ##Calculate mean at each level of predictor
  mutate(pred_income_coll_and_homeown=mean(percapinc.2010))%>% 
  ## Ungroup
  ungroup()%>% 
  #Rank by prediction, with ties sorted randomly
  mutate(pred_income_coll_and_homeown_rank=rank(pred_income_coll_and_homeown,
                                                ties.method="random"))
rmse_cond_mean_two<-rmse(pd$percapinc.2010,pd$pred_income_coll_and_homeown)
rmse_cond_mean_two
## [1] 6371.481
pd<-pd%>%mutate(hs_grad_level=ntile(hs_grad_pc,4))
pd%>%select(county,hs_grad_pc,hs_grad_level)%>%View()
table(pd$hs_grad_level)
## 
##   1   2   3   4 
## 772 772 772 772
pd<-pd%>%group_by(hs_grad_level)%>%
  mutate(pred_income_hs=mean(percapinc.2010))%>%
  ungroup()%>%
  mutate(pred_income_hs_rank=rank(pred_income_hs,ties.method="random"))
rmse_cond_mean_one<-rmse(pd$percapinc.2010,pd$pred_income_hs)
rmse_cond_mean_one
## [1] 6661.028